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Abstract 

Cluster matching by permuting cluster labels is important in many clustering contexts such 
as cluster validation and cluster ensemble techniques. The classic approach is to minimize 
the euclidcan distance between two cluster solutions which induces inappropriate stability 
in certain settings. Therefore, we present the truematch algorithm that introduces two 
improvements best explained in the crisp case. First, instead of maximizing the trace of 
the cluster crosstable, we propose to maximize a ^-transformation of this crosstablc. Thus, 
the trace will not be dominated by the cells with the largest counts but by the cells with 
the most non-random observations, taking into account the marginals. Second, we suggest 
a probabilistic component in order to break ties and to make the matching algorithm truly 
random on random data. The truematch algorithm is designed as a building block of the 
truecluster framework and scales in polynomial time. First simulation results confirm that 
the truematch algorithm gives more consistent truecluster results for unequal cluster sizes. 
Free R software is available. 
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1. Introduction 



Applying a cluster algorithm to a dataset results in — fuzzy or crisp — assignments of cases 
to anonymous clusters. In order to interpret these clusters, we often wish to compare 
these clusters to other classifications, so some heuristic is needed to match one classification 
to another. With the advent of resampling and ensemble methods in clustering (Gordon 



and Vichi, 2001; Dimitriadou et al. 2002; Strehl and Ghosh, 2002), the task of matching 



cluster solutions has become even more important: we need reliable and scalable matching 
algorithms that do the task fully automated. 

Consider, for example, the use of bootstrapping or cross-validation for cluster validation 



as suggested by many authors (Moreau and Jain 1987 Jain and Moreau, 1988 Tibshirani 



et al. , 2001 Roth et al. 2002 



Ben-Hur et al. 



2002 



Dudoit and Fridlyand, 2002): many 



cluster solutions are created and agreement between them is evaluated. Some agreement 



indices do not need explicit cluster matching (Rand 1971 Hubert and Arabie, 1985), but 



others can only be applied after cluster solutions have been matched, for example, Cohen's 



kappa (1960) 



Recently, authors have suggested transfering the idea of bagging (Breiman, 1996) to 



clustering. Some approaches aggregate cluster centers (Leisch 1999; Dolnicar and Leisch 



2000; Bakker and Heskes 2001) or aggregate consensus between pairs of observations (Monti 



et al.| |2003 Dudoit and Fridlyand 2003, BagClust2 algorithm). Other approaches aggre- 
gate cluster assignments and, therefore, require cluster matching, for example, the crisp 
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BagClustl algorithm of Dudoit and Fridlyand (20031, the combination scheme for fuzzy 
clustering of Dimitriadou et al. (2002) or truecluster (Oehlschlagel 2007b). 

Truecluster is an algorithmic framework for robust scalable clustering with model selec- 
tion that combines the idea of bagging with information theoretical model selection along 
the lines of AIC ( |Akaike| |1973| |1974[ ) and BIC ( |Schwarz| |1978[ ). In order to calculate 
its cluster information criterion (CIC), truecluster requires a reliable cluster matching al- 
gorithm. The truematch algorithm presented here was designed to play that role. The 
organization of the paper is as follows: in Section [2] we show an undesirable feature of the 
standard approach to cluster matching. In Section [3] we present the truematch algorithm. 
In Section |4j we demonstrate the benefits of the truematch algorithm within the truecluster 
framework. In Section [5] we use simulation to compare truematch against standard trace 
maximization matching and in Section |6l we discuss our results. 



2. What's wrong with trace maximization of the matching table 



The standard aproach to cluster matching is searching for that permutation of cluster labels 
that minimizes the euclidean distance to a reference cluster solution. This criterion has been 



Gordon and Vichi 


2001; 


Dimitriadou et al. 


(Strehl and Ghosh 


2002 


) or crisp cluster 1 



maximization of matching table counts: cross-tabulating class memberships of two solutions 
and then permuting rows/columns of the matching table until the trace becomes maximal. 
To our knowledge, cluster publications and software differ in the algorithms used to obtain 
trace maximization, but do not question the euclidean criterion per se. 



For example, Dimitriadou et al. (2002) suggested a recursive heuristic to approximate 
trace maximization. It is known that trying all permutations has time complexity O(Kl), 
where K denotes the number of clusters. The Hungarian method improves on this and 



achieves polynomial time complexity 0(K S ). Kuhn (1955) published a pencil and pa- 



per version, which was followed by J.R. Munkres' executable version (Munkres 1957) and 



extended to non-square matrices by Bourgeois and Lassalle (1971). For a list of further al- 



gorithmic approaches to this so-called linear sum assignment problem or weighted bipartite 



matching, see Hornik (2005). 



However, scalablility is not the only quality aspect of a matching algorithm. An impor- 
tant statistical feature of a matching algorithm is the following: if we match two random 
partitions, the matching algorithm should not systematically align the two partitions. We 
now show that the classic trace maximization does not generally possess this feature. 

Assume a cluster algorithm that claims to identify an outlier in a sample of size N = 100 
but which actually declares one case as 'outlying' by random. Now assume a procedure that 
draws two bootstrap samples and clusters them into 99% 'normal' cases and one 'outlier'. 
In 1% of such procedures, the outlier picked in the second sample will randomly match the 
outlier picked in the first sample. In such cases, trace maximization matching will lead to 
a matching table as shown in Table [T] In the other 99%, there will be no match, which — 
by trace maximization — gives a matching table like that shown in Table [2] The resulting 
expected matching table is shown in Table |3j 
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a 


b 


a 


99 





b 





1 



Table 1: Random matching (1%) 





a 


b 


a 


98 


1 


b 


1 






Table 2: Typical trace maximization matching (99%) 





a 


b 


a 


98.01% 


0.99% 


b 


0.99% 


0.01% 



Table 3: Expected trace maximization matching 



We can see that under random clustering, we expect 98.02% on the main diagonal which 
at first glance looks like a strong (non-random) match. Only applying standard random 
correction (Cohen 19601 confirms this to be a pure random match (Cohen's kappa = 0). 
However, in a clustering context we have two objections against relying on such random 
corrections: as far as evaluation of cluster agreement is concerned, random corrections, 
such as Cohen's kappa or Hubert and Arable's corrected rand index do not work properly, 
because spatial neighbors have an above-random chance of being clustered together in the 
absence of any cluster structure in the data. Therefore, agreement indices are too optimistic 
even with random correction. More importantly, in other contexts such as bagging there is 
no random correction available at all. If cluster sizes are (very) different, bagging cluster 
results will suffer because in standard trace maximization big randomly matched cells win 
over small cells representing non-random matches. Therefore, we are looking for a matching 
algorithm that does not systematically generate a strong diagonal under random conditions. 



3. Truematch algorithm 

The problems with standard trace maximization described in the previous section result 
from focusing on raw counts in a situation with unequal marginal (cluster) probabilities. 
From other contexts, we know that this is not a good idea. Take the x 2_ t es t f° r statistical 
independence of two categorial variables. It is not based on raw counts. Instead, the 
matching table of raw counts is transformed to another unit taking the marginals into 
account. Let N denote the total number of observations, the number of observations in 
one row, ni the number of observations in one column and, finally, let nk,i denote the number 
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of observations in one cell of the K x K cluster crosstable. The first step in calculating 
X 2 is to calculate for each cell the number of expected counts h]~ i under the assumption of 
independence: 

AT n k' n l m 

nk,i = Pk ■ Pr N = — (1) 

Then, we transform the matrix of raw counts in Equation [T] into a matrix of normalized 
squared deviations d k ,i from the null model: 

, (nk,l ~ fi k ,i) 2 , v 

d k ,l = x (2) 

™k,l 

The x 2_va hie is defined as the sum of Equation [2] over all cells. If we restore the sign in 
Equation [2j we get: 

Sk,i = sign(n k ,i - h k j) ■ d k ,i (3) 

In order to cope with unequal cluster sizes, we suggest basing cluster matching on 
maximizing the trace of s k i rather than on maximizing the trace of n k i- And in order 
to avoid any systematic not based on the data, we add a probabilistic component to the 
matching algorithm. Consequently we define the truematch algorithm as: 

1. Randomly permute rows and columns of the matching table 

2. Transform the matching table counts n k 1 to signed normalized squared deviations Sk t i 
using Equation [3] 

3. Apply a trace maximization algorithm like the Hungarian method to maximize the 
trace (in fact the Hungarian method minimizes — Sfe,/) 

4. Order the resulting row/column pairs descending by Sf~i breaking ties at random 

If no trace maximization algorithm like the Hungarian method is available, the match- 
ing can easily be done using the truematch heuristic similar to the heuristic suggested by 



Dimitriadou et al. (2002 1 : 



1. Calculate signed normalized squared deviations Ski for all remaining cells of the 
matching table 

2. Order all cells descending by sj-i and by n k 1 (breaking ties by random) and denote 
the first cell as the target cell 

3. Match the row of the target cell to the column of the target cell 

4. Remove the row and the column of the target cell from the matching table 

5. If both the number of remaining rows and columns is at least two, repeat from step 1 
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It is obvious that the truematch algorithm has runtime complexity 0(K 3 ) like the 
Hungarian method. The truematch heuristic also nicely translates into polynomial runtime. 
The number of residuals calculated to reduce the matching table from k to k — 1 is K 2 , thus 
the total number of residuals calculated is 

K 2 + (K - If + (K - 2)» + ... + »> = (tt-(* + ')H2*+l> _ ! 

6 

and, therefore, the truematch heuristic has runtime complexity 0(ET 3 ) and memory com- 
plexity Q(K 2 ) if the recursive nature of the algorithm is realized using a while-loop. 



R package truecluster ( Oehlschlagel 2007a I implements the truematch algorithm in 



mat chindex (method = "truematch") and the truematch heuristic in mat chindex (method 
= "tracemax") efficiently through underlying C-code. 

Applying the truematch algorithm and the truematch heuristic to the above example 
gives identical results: as in standard trace maximization matching, we find 1% random 
matches in matching table [T] but for the 99% non-random matching cases, truematch gen- 
erates two versions of matching tables, see Table |4j Both versions have shifted the majority 
of counts off-diagonal. Due to the probabilistic component in the 2nd step, this leads to 
the expected matching (Table [5| that has a weak trace. Under truematch, only systematic, 
non-random matches will result in a strong diagonal. 





a 


b 


a 


b 


a 


1 


98 


a 


1 





b 





1 


b 


98 


1 



Table 4: Typical truematch (49.5% + 49.5%) 





a 


b 


a 


1.98% 


48.51% 


b 


48.51% 


1.00% 



Table 5: Expected truematch 



We can quantify the benefit of truematch in this case by comparing expected values 
of certain agreement ind ices, cf. Table [6} T he rand index (Rand 1971) and its random 
corrected version crand (Hubert and Arabie 1985) are invariant against row/column per- 



mutations and, thus, do not differ. There is also no difference for kappa (Cohen, 1960). 
However, the big difference is on the simple non-random-corrected diagonal fraction of ob- 
servations: while the trace maximization misleadingly results in an expected diagonal close 
to 1, truematch reduces the expectation of this non-random-corrected index close to zero. 
In the next two sections, we will explore the benefit of truematch in a bagging context, 
where the main diagonal defines the matching but no random correction is available. 
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fraction 


diagonal 


kappa 


rand 


crand 


Tracemax RandomMatch 


1.0% 


1.00 


1.00 


1.000 


1.00 


Tracemax NonRandomMatch 


99.0% 


0.98 


-0.01 


0.960 


-0.01 


Tracemax Expected 


100.0% 


0.98 


0.00 


0.961 


0.00 


Truematch Expected 


100.0% 


0.03 


0.01 


0.961 


0.00 


Truematch NonRandomMatchl 


49.5% 


0.02 


0.00 


0.960 


-0.01 


Truematch NonRandomMatch2 


49.5% 


0.02 


0.00 


0.960 


-0.01 


Truematch RandomMatch 


1.0% 


1.00 


1.00 


1.000 


1.00 



Table 6: Agreement statistics 



4. The role of truematch in truecluster 

suggests a cluster information criterion {CIC) 
that evaluates for each cluster model (for each number of clusters) a N x K matrix P that 
aggregates votes over many resamples. P is created by the multiple match cluster count 
{MMCC) algorithm using the truematch algorithm as follows: 



The truecluster concept ( Oehlschlagel |2007b 



1. Create a N x K matrix C and initialize each cell C^fc with zero 

2. Take a resample (with replacement) of size N, use a base cluster algorithm to fit 
the .fT-cluster model c* to the resample. Then, use a suitable prediction method to 
determine cluster membership of the out-of-resample cases to get a complete cluster 
vector c with N elements c i 

3. For each row in C add one vote (add 1) to the column corresponding to the cluster 
membership in c 

4. Repeat step 2 

5. Estimate cluster memberships c by row-wise majority count in C (breaking ties at 
random), use the truematch algorithm or heuristic to align c with c, and rename the 
clusters in c like the corresponding clusters in c 

6. For each row in C add one vote (add 1) to the column corresponding to the cluster 
membership in c 

7. Repeat from step 4 until some reasonable convergence criterion is reached 

8. Divide each cell in C by its rowsum to get a matrix of estimated cluster membership 
probabilities P 

Table [7] summarizes simulations with truecluster versus consensus clustering: 100 cases, 



10,000 replications, for details see MMCCconcensus . r in R package truecluster (Oehlschlagel 



2007a I, the table is sorted and grouped by the magnitude of CIC values). For random data 



without cluster structure, we would expect very 'fuzzy' P without clear preferences for any 



G 
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cluster. Furthermore, we would expect CIC to increase for models with more true clusters 
and to decrease if models try to distinguish more clusters than justified by the data. 

Table [7] shows that the MMCC algorithm using truematch delivers on this expectation: 
CIC increases for justified clusters and declines for unjustified ones, even if unjustified 
clusters in the model are small. This works because once cluster decisions are unjustified, the 
trumatch algorithm starts distributing its votes randomly across undisti nguishable columns 



of C and, thus, 'fuzzifies' P. Compare that to consensus clustering (Dimitriadou et al. 



2002 



2007 



based on trace maximization obtained with R package clue (Hornik and Boehm 



Hornik 2005 1 . Models with unjustified small clusters get CIC values as high as 



models without the unjustified cluster. This is a consequence of the trace maximization 
matching, adding inappropriate stability to the voting. Take, for example, the "random 
99:1" model, which is as unjustified as the "random 50:50" model but receives a much 
higher CIC value. The stability induced by the trace maximization matching results in 
quite a crisp P2: for each row, we find high probability for one cluster and low probability 
for the other. If we assign cases to clusters based on the maximum probability per row 
in P, all cases are assigned to the same cluster. Such a degenerated P is not wrong but 
unfortunate. If we manually analyze P2, we might detect that P2 actually represents a one- 
cluster (K=l) model. But if we are after automatic selection of models (number of clusters), 
it is misleading that P2 does not represent K = 2 but K = 1. Analyzing a consensus cluster 
solution Px for degeneracies does not really help: the estimated probabilitites can be biased 
even before the matrix formally degenerates. 
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MM C!C! 


lil lit J.V 


TY1 on £*1 Ta 




RMC! 


T 

J_ 


(TTC! 

V 1 V 


random 50:49:1 


1 
1 


Q 

o 


1.0 <o 


n non 
U.UzU 


n C\A A 
U.U44 


-1.0O4 


random 99:1 


1 


2 


1.000 


0.010 


0.014 


-0.985 


random 50:50 


1 


2 


0.995 


0.010 


0.059 


-0.936 


single 100 


1 


1 


0.000 


0.000 


0.000 


0.000 


justified 50 random 49:1 


2 


3 


0.499 


0.018 


0.695 


0.196 


justified 50:50 


2 


2 


0.000 


0.010 


0.990 


0.990 


consensus 


true K 


model K 


H 


RMC 


I 


CIC 


random 50:49:1 


1 


3 


1.066 


0.011 


0.049 


-1.016 


random 50:50 


1 


2 


0.995 


0.010 


0.048 


-0.947 


random 99:1 


1 


2 


0.081 


0.001 


0.001 


-0.080 


single 100 


1 


1 


0.000 


0.000 


0.000 


0.000 


justified 50 random 49:1 


2 


3 


0.071 


0.011 


0.965 


0.895 


justified 50:50 


2 


2 


0.000 


0.010 


0.990 


0.990 







true K 


true number of clusters 


model K 


model number of clusters 


H 


model uncertainty 


RMC 


relative model complexity 


I 


model information 


CIC 


cluster information criterion (I-H) 


single 100 


theoretical values for single group (no cluster) 


random 50:50 


random clustering with 2 equal sized clusters 


random 99:1 


random clustering 2 unequal sized clusters 


random 50:49:1 


random clustering with 3 unequal sized clusters 


justified 50:50 


justified clustering with 2 equal sized cluster 


justified 50 random 49:1 


2 justified clusters, one randomly split unequal sized 



Table 7: consensus cluster vs. truecluster 
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5. Simulation results 

In order to systematically investigate the consequences of the different features of truematch 
versus simple trace maximization matching, we have carried out extensive simulations within 
the truecluster framework: we assume two clusters and vary their relative size p and the 
reliability n of a fictitious clustering algorithm and compare the truecluster results gained 
via trace maximization versus truematch. We did two versions of the simulations: in the 
non-fixed version, p just determines sampling probabilitites; in the fixed version, the ficti- 
tious clustering algorithm enforces the exact relative size p of the two clusters. Details of 
the simulation are given in Appendix A. 

Figure [T] shows information, uncertainty, and its difference CIC for the non-fixed sim- 
ulations. White areas denote simulation trials where the truecluster algorithm degenerated 
from a 2-cluster solution to a 1-cluster solution. The most notable difference is the big 
share of non-converged truecluster solutions using trace maximization, compared to the 
truematch algorithm. The estimated information, given reliability and skewness, is very 
similar and reasonable: information is highest for p = 0.5 and k = 1.0 and is lower for both 
reducing k and/or skewing p. 

By contrast, compared for uncertainty and for the CIC , trace maximization and true- 
match differ dramatically. Using trace maximization, the uncertainty estimate does not 
only depend on n but is also artificially lower for higher skewness. As a consequence, clus- 
ter models with unequal cluster sizes get better CIC values than cluster models with equal 
cluster sizes. Using the truematch algorithm almost avoids this undesirable pattern: the 
estimated uncertainty almost only depends on k, not on p. The estimated CIC shows a 
very reasonable pattern: at high k the CIC is highest for equal sized clusters — conforming 
with the entropy principle — at low k, the CIC is low, however skewed p is. Only at very 
extreme p is the CIC biased downwards: too small clusters cannot be detected with too 
small a sample size. Extreme models are non-identifiable and the uncertainty estimate has 
high variance. Keep in mind that 'extreme' p corresponds to very few cases at a sample 
size of ./V = 100. The fixed simulations gave similar results (Figure [2|. 

In summary, trace maximization fails to estimate uncertainty independent of skewness 
and tends to overestimate CIC for unequal cluster sizes or fails to converge. This restricts 
its usefulness for cluster evaluation and bagging. By contrast, the truematch algorithm 
works at almost any combination of reliability and skewness (with the exception of non- 
identifiable models, given the sample size). 
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6. Discussion 

We have shown that trace maximization matching fails to behave sufficiently neutrally when 
matching clusterings. The problem arises generally but is especially important in contexts 
where random correction is not applicable. As an alternative, we have presented the true- 
match algorithm and heuristic, both probabilistically generate neutral expected matching 
tables and scale in polynomial time. Our simulations have confirmed that truematch avoids 
unjustified (expected) matchings induced by unequal cluster sizes. For the simulations done 
here, the truematch algorithm and the truematch heuristic behave identically. Since the 
truematch heuristic does not guarantee maximizing the x 2 -criterion, we expect the true- 
match algorithm to be superior. However, there is a subtle difference: while the matching 
of the truematch algorithm depends solely on Sk t i, the truematch heuristic uses and 
to select the row/column matches. Therefore, a final decision about an optimal matching 
algorithm needs more investigation. 

Truematch is central to the MMCC algorithm, which creates the basis for the CIC- 
evaluation in the truecluster framework and, thus, contributes to solving the decade-old 
problem of choosing the optimal number of clusters. Beyond that, cluster bagging, in 
general, could benefit from using truematch: the resulting N x K matrix is rather fuzzified 
than degenerated for unjustified cluster splits. This allows for better automated processing 
of such results. It is an open question whether the truematch algorithm also has advantages 
for consensus clustering, or whether different usages of cluster ensembles require different 
matching algorithms. 
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Appendix A. 

In this appendix, we give details concerning the simulations in section [5} assume a vector x 
of length 100 with 'true' sample group memberships where p denotes the fraction of 1 and 
(1 — p) fraction of 0. Let pi denote the matrix of joint probabilities for a case's true and 
clustered classification when the cluster algorithm perfectly separates from 1 (at «; = !). 



Pi 



(l-p) 
p 



Let po denote the matrix of joint probabilities for a case's true and clustered classification 
when the cluster algorithm makes a random guess when separating from 1 (at k = 0). 



Po 



(l-p) 2 (l-p)-p 
(1 — p) • p p 2 



Then p K denotes the matrix of joint probabilities for a case's true and clustered classi- 
fication when the cluster algorithm has reliability k. 



p K = K • Pl + (1 - K) ■ po 

The two conditional probabilites Pid that the clustering algorithm identifies the true 
class, given the true class, are 



Pid = « + (1 



p 



For each value ofp G {1/100, 2/100. .99/100} and each value of k G {0.00, 0.01, 0.02, .., 1.00}, 
we simulate aggregation of 1000 bootstrap samples from x, for each bootstrap sample our 
fictitious cluster algorithm assigns cases with probability p^ to the true class and with 
probability 1 — to the other class. The resulting cluster memberships c* are matched 
versus the (current) estimated cluster memberships c of the cases in the bootstrap sample. 
If c* or c does not contain two classes, the bootstrap sample is dropped and replaced by 
another one. Differently from the MMCC algorithm in Section [4] we do not predict cluster 
memberships of the out-of-bag cases. We use c* directly instead of c , consequently the rows 
of C are not guaranteed to have aggregated an equal number of votes. For all combinations 
of p and k — the resulting 99x101 truecluster models P — we calculate information, uncer- 



tainty, and CIC ( Oehlschlagel 2007b I . These values are visualized using colorcoding and 



contourlines are added based on a loess smooth. To create the fixed version, the complete 
procedure is repeated, additionally enforcing a fixed fraction p by moving randomly selected 
observations in c* from the too big group to the too small one — analogous to a cluster al- 
gorithm that forces certain cluster sizes. The R-code doing the simulation is available in 



truematch.r in package truecluster (Oehlschlagel 2007a). 
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